next up previous contents
Next: What we want from Up: Computational Issues Previous: Computational Issues

Introduction

Natural language processing (NLP) systems vary in their goals, and as such vary in what they require from the lexicon. The computational lexicon is the fundamental repository of information about the primary component of language, i.e. words, and therefore critical for systems which aim to handle some aspect of natural language. Two key issues for the lexicon in NLP tasks are lexical representation and lexical acquisition. This chapter will consider the computational issues resulting from these two lexical aspects, primarily focusing on the challenges meaning ambiguity poses. Polysemy can have a significant impact on the performance of various NLP systems, given the potential unboundedness of sense variation. This is, however, task dependent as some NLP tasks will not require detailed sense discrimination and can effectively make use of lexical entries for which the meaning of a word is underspecified (Kilgarriff 1992, kilgarriff:97b). Underspecification in the lexicon must therefore be balanced with the needs of the task.

For tasks which require rigorous meaning interpretation, the choice of lexical representation must be carefully considered. It is important to decide on a representation which will not only provide the basis for effective sense disambiguation, but which will also be easy to maintain and update. This means that redundancy must be avoided and generalisations must be captured as far as possible. The most straightforward means of achieving these goals is through the use of an inheritance hierarchy which organises lexical items into groups and allows information relevant for a set of items to be stated once, at a node superior to all the items of the set. When adding a new lexical entry given such a structure, one need only state its type and the properties which are peculiar to that particular word (relative to its type). Indeed, this fact has informed my representational choices in this thesis.

The issue of the acquisition of lexical representations is critical for NLP. A common criticism of NLP systems is that they are often ``toy'' implementations which would not scale up well to the task of general language understanding of unrestricted texts. This is clearly true in systems which rely on hand-coding of the lexicon due to the arduous nature of increasing lexical coverage. This factor has led to the search for automated techniques for lexicon acquisition. Initial attempts in this direction were made using Machine Readable Dictionaries (MRDs), and more recently corpora are being consulted as a basis for lexicon development. I will give an overview of various research of both of these types, and will argue that neither results in an adequate representation. I will instead argue for an approach to computational lexicon acquisition which combines top-down linguistic design and bottom-up corpus evidence, and will discuss what a corpus would need to look like in order to give us the relevant information. Although such a corpus is not available today, the desiderata I will put forth point to future research in corpus design and establish a framework in which the lexical representations I utilise in this thesis could in future serve as a basis for larger-scale NLP systems.


next up previous contents
Next: What we want from Up: Computational Issues Previous: Computational Issues